CBOW Embedding Berita#
Notebook ini menjelaskan implementasi CBOW embedding menggunakan Word2Vec untuk menganalisis teks berita. CBOW adalah salah satu arsitektur Word2Vec yang memprediksi kata target berdasarkan konteks kata-kata di sekitarnya.
Tujuan:#
Membuat embedding vektor untuk kata-kata dalam dataset berita
Menggunakan Word2Vec dengan arsitektur CBOW
Mengekstrak fitur numerik dari teks untuk analisis lebih lanjut
1. Instalasi Library#
Menginstal library yang diperlukan:
plotly: untuk visualisasi interaktifgensim: library utama untuk Word2Vec dan embedding
[1]:
%%capture
!pip install plotly
!pip install --upgrade gensim
2. Import Library dan Load Data#
Mengimport library yang diperlukan dan memuat dataset berita yang sudah dipreprocessing:
gensim.models: untuk Word2Vec dan FastTextpandas: untuk manipulasi datasklearn.decomposition.PCA: untuk reduksi dimensimatplotlibdanplotly: untuk visualisasinumpy: untuk operasi numerik
[2]:
from gensim.models import Word2Vec, FastText
import pandas as pd
import re
from sklearn.decomposition import PCA
from matplotlib import pyplot as plt
import plotly.graph_objects as go
import numpy as np
import warnings
warnings.filterwarnings('ignore')
df = pd.read_csv('hasil_preprocessing_berita.csv')
df
[2]:
| isi | hasil_preprocessing | kategori | |
|---|---|---|---|
| 0 | Wakil Ketua DPRSufmi Dasco Ahmadmengungkap isi... | ['wakil', 'ketua', 'dprsufmi', 'dasco', 'ahmad... | nasional |
| 1 | Jaksa Penuntut Umum (JPU) menuntut majelis hak... | ['jaksa', 'tuntut', 'jpu', 'tuntut', 'majelis'... | nasional |
| 2 | Menteri KebudayaanFadli Zonmembeberkan rencana... | ['menteri', 'kebudayaanfadli', 'zonmembeberkan... | nasional |
| 3 | Sebanyak tiga purnawirawan TNI/Polri bergabung... | ['purnawirawan', 'tnipolri', 'gabung', 'dalamk... | nasional |
| 4 | AktorAmmar Zonikembali terlibat dalam kasus pe... | ['aktorammar', 'zonikembali', 'libat', 'edar',... | nasional |
| ... | ... | ... | ... |
| 1595 | Selandia Barubaru saja mengumumkan bahwa negar... | ['selandia', 'barubaru', 'umum', 'negara', 'lo... | gaya-hidup |
| 1596 | Gagal ginjaltermasuk penyakit serius yang dapa... | ['gagal', 'ginjaltermasuk', 'sakit', 'serius',... | gaya-hidup |
| 1597 | Kasuskeracunanusai menyantap programmakan berg... | ['kasuskeracunanusai', 'santap', 'programmakan... | gaya-hidup |
| 1598 | Berkesenian jadi sarana para pejuangkankerpayu... | ['nian', 'sarana', 'pejuangkankerpayudara', 't... | gaya-hidup |
| 1599 | Atlet ski Polandia, Andrzej Bargiel mencatat s... | ['atlet', 'ski', 'polandia', 'andrzej', 'bargi... | gaya-hidup |
1600 rows × 3 columns
3. Definisi Kelas Custom#
MyTokenizer#
Kelas untuk tokenisasi teks sederhana:
Mengubah teks menjadi lowercase
Memisahkan kata berdasarkan spasi
MeanEmbeddingVectorizer#
Kelas untuk mengubah teks menjadi vektor embedding:
Menggunakan model Word2Vec yang sudah dilatih
Mengambil rata-rata vektor kata untuk setiap dokumen
Menangani kata yang tidak ada dalam vocabulary
[3]:
from gensim.models import Word2Vec
4. Preprocessing Teks#
Membersihkan teks berita dengan:
Konversi ke lowercase: Menyeragamkan format teks
Menghapus punctuation: Menghilangkan tanda baca dan karakter non-alfabet
Menghapus HTML tags: Membersihkan tag HTML jika ada
Menghapus digit dan karakter khusus: Membersihkan angka dan karakter non-alfabet
Hasil preprocessing disimpan dalam kolom ‘clean’.
[4]:
import numpy as np
class MyTokenizer:
def fit_transform(self, texts):
# Tokenisasi sederhana: lowercase + split
return [str(text).lower().split() for text in texts]
class MeanEmbeddingVectorizer:
def __init__(self, word2vec_model):
self.word2vec = word2vec_model
# Perbaikan: gunakan vector_size (Gensim ≥ 4.0)
self.dim = word2vec_model.wv.vector_size
def fit(self, X, y=None):
return self
def transform(self, X):
X_tokenized = MyTokenizer().fit_transform(X)
embeddings = []
for words in X_tokenized:
# Ambil vektor hanya untuk kata yang ada di vocab
valid_vectors = [
self.word2vec.wv[word] for word in words
if word in self.word2vec.wv
]
if valid_vectors:
embeddings.append(np.mean(valid_vectors, axis=0))
else:
embeddings.append(np.zeros(self.dim))
return np.array(embeddings)
def fit_transform(self, X, y=None):
return self.transform(X)
5. Pembuatan Corpus dan Training Word2Vec#
Pembuatan Corpus#
Memecah teks yang sudah dibersihkan menjadi list kata
Setiap dokumen menjadi list kata terpisah
Training Model Word2Vec#
Arsitektur: CBOW (default Word2Vec)
min_count=1: Termasuk semua kata (bahkan yang muncul sekali)
vector_size=56: Dimensi vektor embedding 56
Model akan mempelajari representasi vektor untuk setiap kata berdasarkan konteksnya
[5]:
clean_txt = []
for w in range(len(df['hasil_preprocessing'])):
desc = str(df['hasil_preprocessing'][w]).lower()
#remove punctuation
desc = re.sub('[^a-zA-Z]', ' ', desc)
#remove tags
desc=re.sub("</?.*?>"," <> ",desc)
#remove digits and special chars
desc=re.sub("(\\d|\\W)+"," ",desc)
clean_txt.append(desc)
df['clean'] = clean_txt
df.head()
[5]:
| isi | hasil_preprocessing | kategori | clean | |
|---|---|---|---|---|
| 0 | Wakil Ketua DPRSufmi Dasco Ahmadmengungkap isi... | ['wakil', 'ketua', 'dprsufmi', 'dasco', 'ahmad... | nasional | wakil ketua dprsufmi dasco ahmadmengungkap is... |
| 1 | Jaksa Penuntut Umum (JPU) menuntut majelis hak... | ['jaksa', 'tuntut', 'jpu', 'tuntut', 'majelis'... | nasional | jaksa tuntut jpu tuntut majelis hakim adil ne... |
| 2 | Menteri KebudayaanFadli Zonmembeberkan rencana... | ['menteri', 'kebudayaanfadli', 'zonmembeberkan... | nasional | menteri kebudayaanfadli zonmembeberkan rencan... |
| 3 | Sebanyak tiga purnawirawan TNI/Polri bergabung... | ['purnawirawan', 'tnipolri', 'gabung', 'dalamk... | nasional | purnawirawan tnipolri gabung dalamkomite ekse... |
| 4 | AktorAmmar Zonikembali terlibat dalam kasus pe... | ['aktorammar', 'zonikembali', 'libat', 'edar',... | nasional | aktorammar zonikembali libat edar barang hara... |
6. Eksplorasi Model Word2Vec#
Analisis Similaritas Kata#
most_similar(): Mencari kata yang paling mirip dengan kata probe
most_similar_cosmul(): Mencari kata yang mirip dengan kombinasi kata positif dan negatif
doesnt_match(): Mencari kata yang tidak cocok dalam sekelompok kata
Penyimpanan Embedding#
Menyimpan vektor embedding dalam format Word2Vec
File:
berita_embd.txt(format teks, bukan binary)
[6]:
df.shape
[6]:
(1600, 4)
7. Ekstraksi Embedding untuk Dokumen#
Menggunakan MeanEmbeddingVectorizer untuk mengubah setiap dokumen menjadi vektor:
Input: Teks dokumen yang sudah dibersihkan
Proses:
Tokenisasi teks menjadi kata-kata
Ambil vektor embedding untuk setiap kata dari model Word2Vec
Hitung rata-rata vektor kata untuk mendapatkan representasi dokumen
Output: Vektor 56 dimensi untuk setiap dokumen
[7]:
corpus = []
for col in df.clean:
word_list = col.split(" ")
corpus.append(word_list)
#show first value
corpus[0:1]
#generate vectors from corpus
model = Word2Vec(corpus, min_count=1, vector_size = 56)
8. Validasi Embedding#
Memeriksa panjang embedding untuk memastikan konsistensi:
Setiap dokumen harus memiliki vektor dengan panjang 56 (sesuai dengan vector_size)
Ini memastikan bahwa proses embedding berjalan dengan benar
[8]:
# Explore embeddings safely using an in-vocabulary token
# Pick a common Indonesian token if available, else fallback to the first vocab token
candidate_tokens = ['indonesia', 'pemerintah', 'jakarta', 'presiden', 'ekonomi']
probe = None
for tok in candidate_tokens:
if tok in model.wv:
probe = tok
break
if probe is None:
probe = model.wv.index_to_key[0]
print('Probe token:', probe)
print('Top similar:')
print(model.wv.most_similar(probe)[:10])
# Optional: cosine mul example if tokens exist
pos = [t for t in ['pemerintah', 'indonesia'] if t in model.wv]
neg = [t for t in ['oposisi'] if t in model.wv]
if pos:
print('Cosmul example:')
print(model.wv.most_similar_cosmul(positive=pos, negative=neg)[:10])
# Optional: doesnt_match example when enough tokens exist
cands = [t for t in ['ekonomi', 'politik', 'olahraga', 'jakarta'] if t in model.wv]
if len(cands) >= 3:
print('Odd-one-out:')
print(model.wv.doesnt_match(cands))
# Save embeddings
filename = 'berita_embd.txt'
model.wv.save_word2vec_format(filename, binary=False)
Probe token: indonesia
Top similar:
[('irak', 0.9516466856002808), ('laga', 0.9405156970024109), ('vs', 0.9328713417053223), ('tanding', 0.9296777248382568), ('kalah', 0.9205594062805176), ('stadion', 0.9113062620162964), ('lawan', 0.9101036787033081), ('menang', 0.9085850119590759), ('timnas', 0.9003143906593323), ('sengit', 0.8951948285102844)]
Cosmul example:
[('antarklub', 2.581584930419922), ('globel', 2.3720037937164307), ('gastat', 2.3555715084075928), ('pdi', 2.172687530517578), ('semringah', 2.083733320236206), ('sekam', 2.081641435623169), ('pasirarab', 2.040635108947754), ('prarekam', 1.9726847410202026), ('laporanarriyadiyah', 1.954675555229187), ('schengen', 1.9542142152786255)]
Odd-one-out:
jakarta
9. Konversi ke DataFrame#
Mengubah array embedding menjadi DataFrame dengan kolom terpisah:
Input: Array embedding 2D (151 dokumen × 56 fitur)
Proses:
Membuat kolom f1, f2, …, f56 untuk setiap dimensi
Mengisi setiap kolom dengan nilai dari dimensi yang sesuai
Output: DataFrame dengan 151 baris dan 56 kolom fitur
Tujuan: Memudahkan analisis dan visualisasi data
12. Visualisasi Embedding#
Menambahkan visualisasi untuk menganalisis hasil embedding:
PCA Visualization: Reduksi dimensi untuk visualisasi 2D
Similarity Heatmap: Matriks similaritas antar dokumen
Embedding Distribution: Distribusi nilai embedding
Category Analysis: Analisis embedding berdasarkan kategori
[9]:
mean_embedding_vectorizer = MeanEmbeddingVectorizer(model)
mean_embedded = mean_embedding_vectorizer.fit_transform(df['clean'])
10. Penambahan Label (Opsional)#
Mencoba menambahkan kolom label jika tersedia:
Mencari kolom ‘kategori’ dalam DataFrame asli
Jika ditemukan, menyalin label ke DataFrame embedding
Jika tidak ditemukan, memberikan peringatan
Catatan: Label diperlukan untuk supervised learning atau evaluasi model.
[10]:
df['array']=list(mean_embedded)
11. Hasil Akhir#
Ringkasan Proses:#
Preprocessing: Membersihkan teks berita
Training Word2Vec: Membuat model CBOW dengan 56 dimensi
Ekstraksi Embedding: Mengubah dokumen menjadi vektor numerik
Konversi DataFrame: Mengubah array menjadi format tabular
Output:#
DataFrame embedding: 151 baris × 56 kolom fitur
File embedding:
berita_embd.txt(format Word2Vec)Model Word2Vec: Siap digunakan untuk analisis similaritas kata
Aplikasi Selanjutnya:#
Clustering dokumen
Klasifikasi teks
Analisis similaritas dokumen
Visualisasi embedding dengan PCA/t-SNE
[11]:
df.head(5)
[11]:
| isi | hasil_preprocessing | kategori | clean | array | |
|---|---|---|---|---|---|
| 0 | Wakil Ketua DPRSufmi Dasco Ahmadmengungkap isi... | ['wakil', 'ketua', 'dprsufmi', 'dasco', 'ahmad... | nasional | wakil ketua dprsufmi dasco ahmadmengungkap is... | [-0.3047359, 0.2068249, -0.5436013, -0.0116253... |
| 1 | Jaksa Penuntut Umum (JPU) menuntut majelis hak... | ['jaksa', 'tuntut', 'jpu', 'tuntut', 'majelis'... | nasional | jaksa tuntut jpu tuntut majelis hakim adil ne... | [-0.3274044, 0.31003174, -0.45369172, -0.18807... |
| 2 | Menteri KebudayaanFadli Zonmembeberkan rencana... | ['menteri', 'kebudayaanfadli', 'zonmembeberkan... | nasional | menteri kebudayaanfadli zonmembeberkan rencan... | [-0.45348528, 0.097648606, -0.5087352, -0.1018... |
| 3 | Sebanyak tiga purnawirawan TNI/Polri bergabung... | ['purnawirawan', 'tnipolri', 'gabung', 'dalamk... | nasional | purnawirawan tnipolri gabung dalamkomite ekse... | [-0.41289213, 0.14825235, -0.47201398, -0.0513... |
| 4 | AktorAmmar Zonikembali terlibat dalam kasus pe... | ['aktorammar', 'zonikembali', 'libat', 'edar',... | nasional | aktorammar zonikembali libat edar barang hara... | [-0.32000566, 0.1425625, -0.40857485, -0.08050... |
[12]:
df['embedding_length'] = df['array'].str.len()
[13]:
print(df['embedding_length'])
0 56
1 56
2 56
3 56
4 56
..
1595 56
1596 56
1597 56
1598 56
1599 56
Name: embedding_length, Length: 1600, dtype: int64
[14]:
df.shape
[14]:
(1600, 6)
[15]:
num_features = len(df['array'].iloc[0]) # asumsi semua list punya panjang sama
columns = [f'f{i+1}' for i in range(num_features)]
# Inisialisasi dictionary untuk menampung data per kolom
data_dict = {col: [] for col in columns}
# Looping setiap baris di kolom 'embedding'
for embedding_list in df['array']:
for i, value in enumerate(embedding_list):
data_dict[f'f{i+1}'].append(value)
# Buat DataFrame dari dictionary
embedding_df = pd.DataFrame(data_dict)
embedding_df
[15]:
| f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | ... | f47 | f48 | f49 | f50 | f51 | f52 | f53 | f54 | f55 | f56 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.304736 | 0.206825 | -0.543601 | -0.011625 | -0.232782 | -0.388652 | 0.054216 | -0.760374 | -0.606923 | 0.342078 | ... | -0.396228 | 0.917827 | -0.318837 | 0.530248 | 0.180493 | -0.152448 | -0.049415 | 0.445056 | 0.322519 | 0.244274 |
| 1 | -0.327404 | 0.310032 | -0.453692 | -0.188074 | -0.441716 | -0.539700 | 0.245300 | -0.748373 | -0.576681 | 0.264929 | ... | -0.607089 | 0.886531 | -0.420482 | 0.532956 | 0.348477 | -0.433654 | -0.194202 | 0.557776 | 0.272571 | 0.426018 |
| 2 | -0.453485 | 0.097649 | -0.508735 | -0.101847 | -0.211884 | -0.513851 | 0.162222 | -0.640414 | -0.710195 | 0.374572 | ... | -0.486324 | 0.915984 | -0.294784 | 0.421133 | 0.197899 | -0.147033 | 0.049050 | 0.475986 | 0.325635 | 0.217638 |
| 3 | -0.412892 | 0.148252 | -0.472014 | -0.051390 | -0.228315 | -0.435257 | 0.107734 | -0.615625 | -0.602683 | 0.334302 | ... | -0.453517 | 0.855925 | -0.282648 | 0.414384 | 0.212581 | -0.119039 | -0.004500 | 0.429551 | 0.295338 | 0.238155 |
| 4 | -0.320006 | 0.142562 | -0.408575 | -0.080500 | -0.233560 | -0.385516 | 0.067452 | -0.609521 | -0.533253 | 0.278045 | ... | -0.399375 | 0.760419 | -0.290797 | 0.430208 | 0.198839 | -0.167886 | -0.088655 | 0.386003 | 0.224808 | 0.232660 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1595 | -0.587156 | 0.204012 | -0.490395 | -0.127514 | -0.309185 | -0.620009 | 0.140613 | -0.433143 | -0.781598 | 0.302365 | ... | -0.557675 | 0.745641 | -0.371193 | 0.483341 | 0.242069 | -0.205675 | -0.026062 | 0.518818 | 0.316863 | 0.391160 |
| 1596 | -0.614807 | 0.037704 | -0.583241 | -0.160989 | -0.198819 | -0.647895 | 0.338173 | -0.438555 | -0.894842 | 0.298752 | ... | -0.451434 | 0.887128 | -0.400087 | 0.477838 | 0.114502 | -0.158484 | 0.157055 | 0.429494 | 0.412823 | 0.342922 |
| 1597 | -0.529503 | 0.046499 | -0.602205 | -0.170756 | -0.211956 | -0.641523 | 0.335278 | -0.565206 | -0.877382 | 0.365037 | ... | -0.462486 | 0.973211 | -0.342466 | 0.469883 | 0.170135 | -0.173682 | 0.149892 | 0.438381 | 0.429161 | 0.238733 |
| 1598 | -0.341069 | 0.060413 | -0.360465 | -0.086003 | -0.159161 | -0.405737 | 0.164194 | -0.489829 | -0.546824 | 0.226861 | ... | -0.341259 | 0.643250 | -0.235736 | 0.313273 | 0.148872 | -0.105729 | 0.011869 | 0.355090 | 0.235377 | 0.206914 |
| 1599 | -0.436698 | 0.094790 | -0.381166 | -0.161527 | -0.197959 | -0.461096 | 0.248350 | -0.368515 | -0.564194 | 0.225465 | ... | -0.466212 | 0.662952 | -0.251962 | 0.294835 | 0.163671 | -0.136620 | 0.070253 | 0.385139 | 0.270072 | 0.225602 |
1600 rows × 56 columns
[16]:
# Tambahkan kolom label jika tersedia pada df
possible_labels = ['kategori']
label_col = None
for c in possible_labels:
if c in df.columns:
label_col = c
break
if label_col is not None:
embedding_df[label_col] = df[label_col].values
else:
print('Peringatan: Tidak ditemukan kolom label di df. Lewati penyalinan label.')
[17]:
embedding_df
[17]:
| f1 | f2 | f3 | f4 | f5 | f6 | f7 | f8 | f9 | f10 | ... | f48 | f49 | f50 | f51 | f52 | f53 | f54 | f55 | f56 | kategori | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | -0.304736 | 0.206825 | -0.543601 | -0.011625 | -0.232782 | -0.388652 | 0.054216 | -0.760374 | -0.606923 | 0.342078 | ... | 0.917827 | -0.318837 | 0.530248 | 0.180493 | -0.152448 | -0.049415 | 0.445056 | 0.322519 | 0.244274 | nasional |
| 1 | -0.327404 | 0.310032 | -0.453692 | -0.188074 | -0.441716 | -0.539700 | 0.245300 | -0.748373 | -0.576681 | 0.264929 | ... | 0.886531 | -0.420482 | 0.532956 | 0.348477 | -0.433654 | -0.194202 | 0.557776 | 0.272571 | 0.426018 | nasional |
| 2 | -0.453485 | 0.097649 | -0.508735 | -0.101847 | -0.211884 | -0.513851 | 0.162222 | -0.640414 | -0.710195 | 0.374572 | ... | 0.915984 | -0.294784 | 0.421133 | 0.197899 | -0.147033 | 0.049050 | 0.475986 | 0.325635 | 0.217638 | nasional |
| 3 | -0.412892 | 0.148252 | -0.472014 | -0.051390 | -0.228315 | -0.435257 | 0.107734 | -0.615625 | -0.602683 | 0.334302 | ... | 0.855925 | -0.282648 | 0.414384 | 0.212581 | -0.119039 | -0.004500 | 0.429551 | 0.295338 | 0.238155 | nasional |
| 4 | -0.320006 | 0.142562 | -0.408575 | -0.080500 | -0.233560 | -0.385516 | 0.067452 | -0.609521 | -0.533253 | 0.278045 | ... | 0.760419 | -0.290797 | 0.430208 | 0.198839 | -0.167886 | -0.088655 | 0.386003 | 0.224808 | 0.232660 | nasional |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 1595 | -0.587156 | 0.204012 | -0.490395 | -0.127514 | -0.309185 | -0.620009 | 0.140613 | -0.433143 | -0.781598 | 0.302365 | ... | 0.745641 | -0.371193 | 0.483341 | 0.242069 | -0.205675 | -0.026062 | 0.518818 | 0.316863 | 0.391160 | gaya-hidup |
| 1596 | -0.614807 | 0.037704 | -0.583241 | -0.160989 | -0.198819 | -0.647895 | 0.338173 | -0.438555 | -0.894842 | 0.298752 | ... | 0.887128 | -0.400087 | 0.477838 | 0.114502 | -0.158484 | 0.157055 | 0.429494 | 0.412823 | 0.342922 | gaya-hidup |
| 1597 | -0.529503 | 0.046499 | -0.602205 | -0.170756 | -0.211956 | -0.641523 | 0.335278 | -0.565206 | -0.877382 | 0.365037 | ... | 0.973211 | -0.342466 | 0.469883 | 0.170135 | -0.173682 | 0.149892 | 0.438381 | 0.429161 | 0.238733 | gaya-hidup |
| 1598 | -0.341069 | 0.060413 | -0.360465 | -0.086003 | -0.159161 | -0.405737 | 0.164194 | -0.489829 | -0.546824 | 0.226861 | ... | 0.643250 | -0.235736 | 0.313273 | 0.148872 | -0.105729 | 0.011869 | 0.355090 | 0.235377 | 0.206914 | gaya-hidup |
| 1599 | -0.436698 | 0.094790 | -0.381166 | -0.161527 | -0.197959 | -0.461096 | 0.248350 | -0.368515 | -0.564194 | 0.225465 | ... | 0.662952 | -0.251962 | 0.294835 | 0.163671 | -0.136620 | 0.070253 | 0.385139 | 0.270072 | 0.225602 | gaya-hidup |
1600 rows × 57 columns
[18]:
embedding_df.shape
[18]:
(1600, 57)
[19]:
# 1. PCA Visualization untuk Embedding
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import plotly.express as px
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np
# Reduksi dimensi dengan PCA
pca = PCA(n_components=2)
embedding_2d = pca.fit_transform(embedding_df.iloc[:, :-1]) # Exclude kategori column
# Visualisasi dengan Matplotlib
plt.figure(figsize=(12, 8))
categories = embedding_df['kategori'].unique()
colors = ['red', 'blue', 'green', 'orange', 'purple']
for i, category in enumerate(categories):
mask = embedding_df['kategori'] == category
plt.scatter(embedding_2d[mask, 0], embedding_2d[mask, 1],
c=colors[i % len(colors)], label=category, alpha=0.7, s=50)
plt.xlabel(f'PC1 ({pca.explained_variance_ratio_[0]:.2%} variance)')
plt.ylabel(f'PC2 ({pca.explained_variance_ratio_[1]:.2%} variance)')
plt.title('PCA Visualization of News Embeddings by Category')
plt.legend()
plt.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print(f"Explained variance ratio: PC1={pca.explained_variance_ratio_[0]:.3f}, PC2={pca.explained_variance_ratio_[1]:.3f}")
print(f"Total explained variance: {pca.explained_variance_ratio_.sum():.3f}")
Explained variance ratio: PC1=0.371, PC2=0.217
Total explained variance: 0.587
[20]:
# 3. Similarity Heatmap untuk beberapa dokumen
# Ambil sample 20 dokumen untuk heatmap
sample_size = min(20, len(embedding_df))
sample_indices = np.random.choice(len(embedding_df), sample_size, replace=False)
sample_embeddings = embedding_df.iloc[sample_indices, :-1] # Exclude kategori
# Hitung cosine similarity
similarity_matrix = cosine_similarity(sample_embeddings)
# Visualisasi heatmap dengan Matplotlib
plt.figure(figsize=(10, 8))
plt.imshow(similarity_matrix, cmap='viridis', aspect='auto')
plt.colorbar(label='Cosine Similarity')
plt.title('Cosine Similarity Matrix of News Embeddings (Sample)')
plt.xlabel('Document Index')
plt.ylabel('Document Index')
# Tambahkan label kategori
categories_sample = embedding_df.iloc[sample_indices]['kategori'].values
for i, cat in enumerate(categories_sample):
plt.text(i, -0.5, cat[:3], rotation=45, ha='right', va='top', fontsize=8)
plt.tight_layout()
plt.show()
print(f"Similarity matrix shape: {similarity_matrix.shape}")
print(f"Average similarity: {similarity_matrix.mean():.3f}")
print(f"Max similarity: {similarity_matrix.max():.3f}")
print(f"Min similarity: {similarity_matrix.min():.3f}")
Similarity matrix shape: (20, 20)
Average similarity: 0.964
Max similarity: 1.000
Min similarity: 0.871
[21]:
# 4. Distribusi Embedding per Kategori
fig, axes = plt.subplots(2, 2, figsize=(15, 10))
axes = axes.ravel()
# Ambil beberapa fitur untuk dianalisis
feature_cols = ['f1', 'f2', 'f3', 'f4']
for i, feature in enumerate(feature_cols):
for category in embedding_df['kategori'].unique():
data = embedding_df[embedding_df['kategori'] == category][feature]
axes[i].hist(data, alpha=0.6, label=category, bins=20)
axes[i].set_title(f'Distribution of {feature} by Category')
axes[i].set_xlabel(feature)
axes[i].set_ylabel('Frequency')
axes[i].legend()
axes[i].grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
[22]:
# 5. Analisis Similaritas Kata dengan Word2Vec
# Ambil beberapa kata yang ada dalam vocabulary
vocab_words = list(model.wv.key_to_index.keys())[:20] # Ambil 20 kata pertama
# Hitung similarity matrix untuk kata-kata
word_similarities = []
for word1 in vocab_words:
row = []
for word2 in vocab_words:
if word1 in model.wv and word2 in model.wv:
similarity = model.wv.similarity(word1, word2)
row.append(similarity)
else:
row.append(0)
word_similarities.append(row)
word_similarities = np.array(word_similarities)
# Visualisasi heatmap similarity kata
plt.figure(figsize=(12, 10))
plt.imshow(word_similarities, cmap='viridis', aspect='auto')
plt.colorbar(label='Word Similarity')
plt.title('Word Similarity Matrix (Word2Vec)')
plt.xlabel('Words')
plt.ylabel('Words')
# Set labels
plt.xticks(range(len(vocab_words)), vocab_words, rotation=45, ha='right')
plt.yticks(range(len(vocab_words)), vocab_words)
plt.tight_layout()
plt.show()
print(f"Vocabulary size: {len(model.wv.key_to_index)}")
print(f"Sample words: {vocab_words[:10]}")
Vocabulary size: 23603
Sample words: ['', 'indonesia', 'to', 'with', 'content', 'scroll', 'continue', 'advertisement', 'jalan', 'milik']
[23]:
# Test Plotly setelah install nbformat
import plotly.express as px
import pandas as pd
import numpy as np
# Buat data test sederhana
test_data = pd.DataFrame({
'x': np.random.randn(10),
'y': np.random.randn(10),
'category': ['A', 'B', 'A', 'C', 'B', 'A', 'C', 'B', 'A', 'C']
})
# Test plotly
fig = px.scatter(test_data, x='x', y='y', color='category', title='Test Plotly')
fig.show()
print("✅ Plotly berhasil dijalankan! Error nbformat sudah teratasi.")
Data type cannot be displayed: application/vnd.plotly.v1+json
✅ Plotly berhasil dijalankan! Error nbformat sudah teratasi.
[24]:
# Solusi 2: Install ulang library di dalam notebook
import sys
!{sys.executable} -m pip install --upgrade nbformat ipython
Requirement already satisfied: nbformat in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (5.10.4)
Requirement already satisfied: ipython in c:\users\user\appdata\roaming\python\python311\site-packages (9.6.0)
Requirement already satisfied: fastjsonschema>=2.15 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from nbformat) (2.21.2)
Requirement already satisfied: jsonschema>=2.6 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from nbformat) (4.25.1)
Requirement already satisfied: jupyter-core!=5.0.*,>=4.12 in c:\users\user\appdata\roaming\python\python311\site-packages (from nbformat) (5.8.1)
Requirement already satisfied: traitlets>=5.1 in c:\users\user\appdata\roaming\python\python311\site-packages (from nbformat) (5.14.3)
Requirement already satisfied: colorama in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from ipython) (0.4.6)
Requirement already satisfied: decorator in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (5.2.1)
Requirement already satisfied: ipython-pygments-lexers in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (1.1.1)
Requirement already satisfied: jedi>=0.16 in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (0.19.2)
Requirement already satisfied: matplotlib-inline in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (0.1.7)
Requirement already satisfied: prompt_toolkit<3.1.0,>=3.0.41 in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (3.0.52)
Requirement already satisfied: pygments>=2.4.0 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from ipython) (2.19.2)
Requirement already satisfied: stack_data in c:\users\user\appdata\roaming\python\python311\site-packages (from ipython) (0.6.3)
Requirement already satisfied: typing_extensions>=4.6 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from ipython) (4.15.0)
Requirement already satisfied: wcwidth in c:\users\user\appdata\roaming\python\python311\site-packages (from prompt_toolkit<3.1.0,>=3.0.41->ipython) (0.2.14)
Requirement already satisfied: parso<0.9.0,>=0.8.4 in c:\users\user\appdata\roaming\python\python311\site-packages (from jedi>=0.16->ipython) (0.8.5)
Requirement already satisfied: attrs>=22.2.0 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from jsonschema>=2.6->nbformat) (25.3.0)
Requirement already satisfied: jsonschema-specifications>=2023.03.6 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from jsonschema>=2.6->nbformat) (2025.9.1)
Requirement already satisfied: referencing>=0.28.4 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from jsonschema>=2.6->nbformat) (0.36.2)
Requirement already satisfied: rpds-py>=0.7.1 in c:\users\user\appdata\local\programs\python\python311\lib\site-packages (from jsonschema>=2.6->nbformat) (0.27.1)
Requirement already satisfied: platformdirs>=2.5 in c:\users\user\appdata\roaming\python\python311\site-packages (from jupyter-core!=5.0.*,>=4.12->nbformat) (4.4.0)
Requirement already satisfied: pywin32>=300 in c:\users\user\appdata\roaming\python\python311\site-packages (from jupyter-core!=5.0.*,>=4.12->nbformat) (311)
Requirement already satisfied: executing>=1.2.0 in c:\users\user\appdata\roaming\python\python311\site-packages (from stack_data->ipython) (2.2.1)
Requirement already satisfied: asttokens>=2.1.0 in c:\users\user\appdata\roaming\python\python311\site-packages (from stack_data->ipython) (3.0.0)
Requirement already satisfied: pure-eval in c:\users\user\appdata\roaming\python\python311\site-packages (from stack_data->ipython) (0.2.3)
[25]:
# Solusi 3: Set renderer plotly yang berbeda
import plotly.io as pio
# Coba beberapa renderer yang berbeda
try:
# Renderer untuk Jupyter notebook
pio.renderers.default = "notebook"
print("✅ Renderer set ke 'notebook'")
except:
try:
# Renderer untuk browser
pio.renderers.default = "browser"
print("✅ Renderer set ke 'browser'")
except:
# Renderer HTML
pio.renderers.default = "html"
print("✅ Renderer set ke 'html'")
# Test dengan data sederhana
import plotly.express as px
import pandas as pd
import numpy as np
test_data = pd.DataFrame({
'x': [1, 2, 3, 4, 5],
'y': [2, 4, 1, 3, 5],
'category': ['A', 'B', 'A', 'C', 'B']
})
fig = px.scatter(test_data, x='x', y='y', color='category', title='Test Plotly dengan Renderer Baru')
fig.show()
✅ Renderer set ke 'notebook'
[26]:
# Solusi untuk Error Similarity Heatmap
# Import library yang diperlukan
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import numpy as np
# Cek apakah embedding_df sudah ada
try:
# Cek apakah embedding_df sudah dibuat
if 'embedding_df' not in locals():
print("❌ Error: embedding_df belum dibuat. Jalankan cell sebelumnya terlebih dahulu.")
else:
print(f"✅ embedding_df tersedia dengan shape: {embedding_df.shape}")
# Cek apakah kolom kategori ada
if 'kategori' not in embedding_df.columns:
print("❌ Error: Kolom 'kategori' tidak ditemukan di embedding_df")
print(f"Kolom yang tersedia: {list(embedding_df.columns)}")
else:
print("✅ Kolom 'kategori' tersedia")
except NameError as e:
print(f"❌ Error: {e}")
print("Pastikan semua cell sebelumnya sudah dijalankan dengan benar.")
✅ embedding_df tersedia dengan shape: (1600, 57)
✅ Kolom 'kategori' tersedia
[27]:
# Kode Similarity Heatmap yang Diperbaiki dengan Error Handling
def create_similarity_heatmap(embedding_df, sample_size=20):
"""
Membuat similarity heatmap dengan error handling yang baik
Parameters:
- embedding_df: DataFrame yang berisi embedding
- sample_size: Jumlah sample untuk heatmap (default: 20)
"""
try:
# Import library yang diperlukan
from sklearn.metrics.pairwise import cosine_similarity
import matplotlib.pyplot as plt
import numpy as np
# Cek apakah embedding_df ada
if 'embedding_df' not in locals() and 'embedding_df' not in globals():
print("❌ Error: embedding_df tidak ditemukan")
return None
# Cek apakah ada data
if len(embedding_df) == 0:
print("❌ Error: embedding_df kosong")
return None
# Tentukan kolom fitur (exclude kolom non-numerik)
feature_cols = [col for col in embedding_df.columns if col.startswith('f')]
if len(feature_cols) == 0:
print("❌ Error: Tidak ditemukan kolom fitur (f1, f2, dll)")
return None
print(f"✅ Ditemukan {len(feature_cols)} kolom fitur")
# Ambil sample dokumen
sample_size = min(sample_size, len(embedding_df))
sample_indices = np.random.choice(len(embedding_df), sample_size, replace=False)
sample_embeddings = embedding_df.iloc[sample_indices][feature_cols]
print(f"✅ Sample {sample_size} dokumen untuk heatmap")
# Hitung cosine similarity
similarity_matrix = cosine_similarity(sample_embeddings)
# Visualisasi heatmap
plt.figure(figsize=(12, 10))
plt.imshow(similarity_matrix, cmap='viridis', aspect='auto')
plt.colorbar(label='Cosine Similarity')
plt.title('Cosine Similarity Matrix of News Embeddings (Sample)')
plt.xlabel('Document Index')
plt.ylabel('Document Index')
# Tambahkan label kategori jika tersedia
if 'kategori' in embedding_df.columns:
categories_sample = embedding_df.iloc[sample_indices]['kategori'].values
for i, cat in enumerate(categories_sample):
plt.text(i, -0.5, str(cat)[:3], rotation=45, ha='right', va='top', fontsize=8)
plt.text(0, -1.5, "Kategori:", fontsize=10, fontweight='bold')
else:
print("⚠️ Kolom 'kategori' tidak ditemukan, heatmap tanpa label kategori")
plt.tight_layout()
plt.show()
# Statistik similarity
print(f"\n📊 Statistik Similarity Matrix:")
print(f" Shape: {similarity_matrix.shape}")
print(f" Average similarity: {similarity_matrix.mean():.3f}")
print(f" Max similarity: {similarity_matrix.max():.3f}")
print(f" Min similarity: {similarity_matrix.min():.3f}")
# Hitung similarity tanpa diagonal (self-similarity)
mask = ~np.eye(similarity_matrix.shape[0], dtype=bool)
off_diagonal_similarities = similarity_matrix[mask]
print(f" Average similarity (excluding diagonal): {off_diagonal_similarities.mean():.3f}")
return similarity_matrix
except Exception as e:
print(f"❌ Error dalam membuat similarity heatmap: {str(e)}")
print("Pastikan semua library sudah terinstall dan data sudah siap")
return None
# Jalankan fungsi
print("🚀 Membuat Similarity Heatmap...")
similarity_matrix = create_similarity_heatmap(embedding_df, sample_size=20)
🚀 Membuat Similarity Heatmap...
✅ Ditemukan 56 kolom fitur
✅ Sample 20 dokumen untuk heatmap
📊 Statistik Similarity Matrix:
Shape: (20, 20)
Average similarity: 0.961
Max similarity: 1.000
Min similarity: 0.891
Average similarity (excluding diagonal): 0.959